Reduce_mul implementation by emrys53 · Pull Request #1132 · xtensor-stack/xsimd

emrys53 · 2025-07-01T17:28:06Z

While using XSIMD in our library, we realized that a dedicated reduce_mul function was missing. This PR introduces a reduce_mul implementation for all architectures and supported types, with architecture-specific optimizations for SSE, AVX, and AVX512 instruction sets. Relevant unit tests are included.

I was unable to test the AVX512-specific implementation on my machine, which only supports AVX2. For AVX512, I used mm512_reduce_mul{type} intrinsics. According to Intel's documentation, these are sequential rather than parallel. Since I couldn't benchmark or compare alternative implementations, I left it as-is. I'm open to suggestions for improvement.
Motivation

Although XSIMD provides a general reduce function, using it with std::complex or std::complex fails, as as_unsigned_integer_t is not specialized for complex types. I'm not sure whether this is intended behavior, but I wanted to mention it for context. Providing a reduce_mul directly avoids this issue.

This is my first contribution to XSIMD, and I’ve only been using the library for about a week. Feedback is very welcome, and I’m happy to adjust or improve the code based on your review.

…86_64 architecture

serge-sans-paille · 2025-07-01T19:35:31Z

include/xsimd/arch/common/xsimd_common_arithmetic.hpp


+        //  hmul
+        template <class A, class T, class /*=typename std::enable_if<std::is_integral<T>::value, void>::type*/>
+        XSIMD_INLINE T hmul(batch<T, A> const& self, requires_arch<common>) noexcept


shouldn't this just be the generic implementation of reduce_mul, and if so, why not using the current generic reduction implementation?

This function is only used for integral types with sizeof(T) < 4. The reason I implemented like this was because for example when testing reduce_mul for uint8_t with AVX machine. What happens is that it has 32 elements which means general reduce function multiplies and swizzle the batch 5 times (32->16->8->4->2->1) and in the end it calls self.get(0) (which is just too inefficient anyway issue #1133). So I thought it would be more efficient to implement it this way. What it does with uint8_t in my implementation is that it first splits the batch into 2 equal sse4_2 batches multiplies them together and then calls hmul.

Also the swizzle function for uint8_t is just loading the batch to a buffer and storing the result back. So using that 5 times (in AVX) is just too inefficient compared to split_avx function which splits it using SIMD intrinsics.

@serge-sans-paille you haven't responded back. If you are open to the idea of reduce_mul implementation. I can fix the code so that it passes all the tests. If you have feedbacks or recommendations to the code I have written, you can write it here and I can fix it after we discuss.

DiamonDinoia · 2025-08-08T21:09:33Z

I'm also interested in this. @emrys53 could you rebase master?
@serge-sans-paille what do you think needs to be changes to integrate it?

It changes nothing for clang which has a nice shuffle optimizer, but it does help gcc a lot. Somehow related to #1132

@emrys53